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Faster 64-bit universal hashing using carry-less multiplications 
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Abstract Intel and AMD support the Carry-less Multipli¬ 
cation (CLMUL) instruction set in their x64 processors. We 
use CLMUL to implement an almost universal 64-bit hash 
family (CLHASH). We compare this new family with what 
might be the fastest almost universal family on x64 proces¬ 
sors (VHASH). We find that CLHASH is at least 60% 
faster. We also compare CLHASH with a popular hash func¬ 
tion designed for speed (Google’s CityHash). We find that 
CLHASH is 40 % faster than CityHash on inputs larger than 
64 bytes and just as fast otherwise. 

Keywords Universal hashing. Carry-less multiplication. 
Finite field arithmetic 


1 Introduction 

Hashing is the fundamental operation of mapping data ob¬ 
jects to fixed-size hash values. For example, all objects in 
the Java programming language can be hashed to 32-bit in¬ 
tegers. Many algorithms and data structures rely on hashing: 
e.g., authentication codes. Bloom filters and hash tables. We 
typically assume that given two data objects, the probabil¬ 
ity that they have the same hash value (called a collision) is 
low. When this assumption fails, adversaries can negatively 
impact the performance of these data structures or even cre¬ 
ate denial-of-service attacks. To mitigate such problems, we 
can pick hash functions at random (henceforth called ran¬ 
dom hashing). 
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Random hashing is standard in Ruby, Python and Perl. 
It is allowed explicitly in Java and C-H-ll. There are many 
fast random hash families — e.g., MurmurHash, Google’s 
CityHash |[35l, SipHash H and VHASH Qll. Cryptogra¬ 
phers have also designed fast hash families with strong the¬ 
oretical guarantees ll6llT8ll24l . However, much of this work 
predates the introduction of the CLMUL instruction set in 
commodity x86 processors. Intel and AMD added CLMUL 
and its pclmulqdq instruction to their processors to accel¬ 
erate some common cryptographic operations. Although the 
pclmulqdq instruction first became available in 2010, its 
high cost in terms of CPU cycles — specifically an 8-cycle 
throughput on pre-Haswell Intel microarchitectures and a 7- 
cycle throughput on pre-Jaguar AMD microarchitectures — 
limited its usefulness outside of cryptography. However, the 
throughput of the instruction on the newer Haswell archi¬ 
tecture is down to 2 cycles, even though it remains a high 
latency operation (7 cycles) msiiniQs ee Table Our main 
contribution is to show that the pclmulqdq instruction can 
be used to produce a 64-bit string hash family that is faster 
than known approaches while offering stronger theoretical 
guarantees. 

2 Random Hashing 

In random hashing, we pick a hash function at random from 
some family, whereas an adversary might pick the data in¬ 
puts. We want distinct objects to be unlikely to hash to the 
same value. That is, we want a low collision probability. 

We consider hash functions from X to [0,2^). An L-bit 
family is universal IIolIII] if the probability of a collision is 
no more than 2~^. That is, it is universal if 

P (h(x) = h{x')) < 2~^ 

* The low-power AMD Jaguar microarchitecture does even better 
with a throughput of 1 cycle and a latency of 3 cycles. 
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Table 1; Relevant SIMD intrinsics and instructions on Haswell Intel processors, with latency and reciprocal throughput in 
CPU cycles per instruction ifTellin . 


intrinsic 

instruction 

description 

latency 

rec. thr. 

_mm_clmulepi64_si 128 

pclmulqdq 

64-bit carry-less multiplication 

7 

2 

_mm_or_sil28 

por 

bitwise OR 

1 

0.33 

_mm_xor_sil28 

pxor 

bitwise XOR 

1 

0.33 

_mm_slli_epi64 

psllq 

shift left two 64-bit integers 

1 

1 

_mm_srli_sil28 

psrldq 

shift right by x bytes 

1 

0.5 

_mm_shuffle_epi 8 

pshufb 

shuffle 16 bytes 

1 

0.5 

_mm_cvtsi64_si 128 

movq 

64-bit integer as 128-bit reg. 

1 

- 

_mm_cvtsi 128_si64 

movq 

64-bit integer from 128-bit reg. 

2 

- 

_mmJoad_sil28 

movdqa 

load a 128-bit reg. from memory 
(aligned) 

1 

0.5 

_mm Jddqu_si 128 

Iddqu 

load a 128-bit reg. from memory 
(unaligned) 

1 

0.5 

_mm_setr_epi 8 


construct 128-bit reg. from 
16 bytes 



_mm_set_epi64x 


construct 128-bit reg. from two 
64-bit integers 




Table 2: Notation and basic dehnitions 


/i : X { 0 , 1 ,..., 2 ^ - 1 } 

L-bit hash function 

universal 

P(h{x) = h{x')) < 1/2-^ for 

X ^ x' 

e-almost universal 

P {h{x) = h{x')) < e for x 7 ^ 

XOR-universal 

P{h{x) = h{x')®c) < 1/2^ 
for any c S [ 0 , 2 ^) and distinct 


x^x' ^ X 

e-almost XOR-universal 

P {h{x) = h{x') ® c) < e for 
any integer c £ [ 0 , 2 ^) and dis- 


tinct x,x' £ X 


for any hxed x,x' G X such that x ^ x', given that we 
pick h at random from the family. It is e-almost univer¬ 
sal 0^ (also written e-AU) if the probability of a collision is 
bounded by e. I.e., P {h{x) = h{x')) < e, for any x,x' G X 
such that x ^ x'. (See Table |^) 


2.1 Safely Reducing Hash Values 

Almost universality can be insufficient to prevent frequent 
collisions since a given algorithm might only use the first 
few bits of the hash values. Consider hash tables. A hash 
table might use as a key only the hrst b bits of the hash values 
when its capacity is 2^. Yet even if a hash family is e-almost 
universal, it could still have a high collision probability on 
the first few bits. 

For example, take any 32-bit universal family H, and 
derive the new 64-bit 1/2^^-almost universal 64-bit family 
by taking the functions from H and multiplying them by 
2^^: h'{x) = h{x) x 2^^. Clearly, all functions from this 


new family collide with probability 1 on the first 32 bits, 
even though the collision probability on the full hash values 
is low (1/2^^). Using the hrst bits of these hash functions 
could have disastrous consequences in the implementation 
of a hash table. 

Therefore, we consider stronger forms of universality. 

- A family is A-universal 03711 141 if 

P{h{x) = h{x) -\- c mod 2^) < 2~^ 

for any constant c and any x,x' G X such that x ^ x'. 
It is e-almost Z\-universal if P{h{x) = h{x') -f c mod 
2^ < e for any constant c and any x,x' G X such that 
x ^ x'. 

- A family is e-almost XOR-universal if 

P {h{x) = h(x') 0 c) < e 

for any integer constant c G [0, 2^) and any x,x' G X 
such that X ^ x' (where 0 is the bitwise XOR). A family 
that is 1/2^-almost XOR-universal is said to be XOR- 
universal iIjtI . 

Given an e-almost zi-universal family T-L of hash func¬ 
tions h : X ^ [0, 2^), the family of hash functions 

{h{x) mod 2^ \ h G H} 

from X to [0,2^') is 2^~^' x e-almost Z\-universal ifT^ . 
The next lemma shows that a similar result applies to almost 
XOR-almost universal families. 

Lemma 1 Given an e-almost XOR-universal family % of 
hash functions h : X -G [0,2^) and any positive integer 
L' < L, the family of hash functions {h{x) mod 2^ \ h G 
T-i} from X to [0, 2^ ) is 2^~^ x e-almost XOR-universal. 
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Proof For any integer constant c S [0, 2^), consider the 
equation h(x) = {h{x') 0 c) mod 2^ for x x' with h 
picked from H. Pick any positive integer L' < L. We have 

P{h{x) = {h{x') 0 c mod 2^ )) 

= ^ P{h{x) = h{x') ® c® z) 

z I 2 mod 2^' 

where the sum is over 2^“^ distinct z values. Because T-L 
is e-almost XOR-universal, we have that P{h{x) = h{x') 0 
c0z) < e for any c and any z. Thus, we have that P{h{x) = 
h{x') 0 c mod 2^') < 2^~^'e, showing the result. 

It follows from Lemma[2that if a family is XOR-universal, 
then its modular reductions are XOR-universal as well. 

As a straightforward extension of this lemma, we could 
show that when picking any L' bits (not only the least sig¬ 
nificant), the result is 2^“^ x e-almost XOR-universal. 

2.2 Composition 

It can be useful to combine different hash families to create 
new ones. For example, it is common to compose hash fam¬ 
ilies. When composing hash functions (h = g o /), the uni¬ 
versality degrades linearly: if g is picked from an e^-almost 
universal family and / is picked (independently) from an 
e/-almost universal family, the result is eg + e/-almost uni¬ 
versal 0^ . 

We sketch the proof. For x x', we have that g{f{x)) = 
g{f{x')) collides if /(x) = f'{x). This occurs with proba¬ 
bility at most ef since / is picked from an e/-almost uni¬ 
versal family. If not, they collide if g{y) = g{y') where 
y = f{x) and y' = f{x'), with probability bounded by 
Cg. Thus, we have bounded the collision probability hy Cf + 
(1 — ef)eg < ef + eg, establishing the result. 

By extension, we can show that if g is picked from an 
Cg-almost XOR-universal family, then the composed result 
(h = g o f) is, going to be eg + e/-almost XOR-universal. It 
is not required for / to be almost XOR-universal. 

2.3 Hashing Tuples 

If we have universal hash functions from X to [0,2^), then 
we can construct hash functions from X"* to [0,2^)™ while 
preserving universality. The construction is straightforward: 
h'{xi,X2 ,..., Xm) = {h{xi),h{x2 ),..., h{xm))- If h is 
picked from an e-almost universal family, then the result is 
e-almost universal. This is true even though a single h is 
picked and reused m times. 

Lemma 2 Consider an e-almost universal family % from 
X to [0,2^). Then consider the family of functions %' of 
the form h'{xi,X 2 , ■■■ ,Xm) = {h{xi),h{x 2 ), ■ ■ ■ ,h{xm)) 


from X'^ to [0, 2^)"*, where h is in Ti. Family T-L' is e-almost 
universal. 

The proof is not difficult. Consider two distinct values from 
X™, Xi,X 2 , ■ ■ ■ ,Xm and x'^, X 2 ,..., x^. Because the tuples 
are distinct, they must differ in at least one component: Xi 7 ^ 
x'. It follows that h'{xi,X 2 , ■ ■ ■, Xm) and h'{x[,X 2 ,..., x^) 
collide with probability at most P{h{xi) = h{x'^)) < e, 
showing the result. 

2.4 Variable-Length Hashing From Fixed-Length Hashing 

Suppose that we are given a family Ti of hash functions that 
is XOR universal over fixed-length strings. That is, we have 
that P (/i(s) = h{s') 0 c) < 1/2^ if the length of s is the 
same as the length of s' (|s| = |s'|). We can create a new 
family that is XOR universal over variable-length strings 
by introducing a hash family on string lengths. Let ^ be a 
family of XOR universal hash functions g over length val¬ 
ues. Consider the new family of hash functions of the form 
h{s) 0 5 (|s|) where h G Ti and g G Q. Let us consider two 
distinct strings s and s'. There are two cases to consider. 

- If s and s' have the same length so that p(|s|) = 5 (|s'|) 
then we have XOR universality since 

P {h{s) 0 g{\s\) = h{s') 0 5 (|s'|) 0 c) 

= P {h{s) = h{s') 0 c) 

< 1 / 2 ^ 

where the last inequality follows because h G Ti, an 
XOR universal family over fixed-length strings. 

- If the strings have different lengths (|s| |s'|), then we 

again have XOR universality because 

P {h{s) 0 g{\s\) = h{s') 0 5 (|s'|) 0 c) 

= -P(ff(|s|) = 5(|s'l) ® (c 0 /i(s) 0 h{s'))) 

= Pigi\s\)=g{\s'\)®c') 

< 

where we set c' = c® h{s) ® h{s'), a value independent 
from |s| and |s'|. The last inequality follows because g 
is taken from a family Q that is XOR universal. 

Thus the result (/i(s) 05 (|s|)) is XOR universal. We can also 
generalize the analysis. Indeed, if Ti and Q are e-almost uni¬ 
versal, we could show that the result is e-almost universal. 
We have the following lemma. 

Lemma 3 Let Ti be an XOR universal family of hash func¬ 
tions over fixed-length strings. Let Q be an XOR universal 
family of hash functions over integer values. We have that 
the family of hash functions of the form s -G h(s) 0 g(|s|) 
where h G Ti and g G Q is XOR universal over all strings. 

Moreover, ifTi and Q are merely e-almost universal, then 
the family of hash functions of the form s -G h{s) 0 < 7 (|s|) 
is also e-almost universal. 
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2.5 Minimally Randomized Hashing 

Many hashing algorithms — for instance, CityHash lll5l — 
rely on a small random seed. The 64-bit version of CityHash 
takes a 64-bit integer as a seed. Thus, we effectively have a 
family of 2®"^ hash functions — one for each possible seed 
value. 

Given such a small family (i.e., given few random bits), 
we can prove that it must have high collision probabilities. 
Indeed, consider the set of all strings of m 64-bit words. 
There are 2®'^'" such strings. 

- Pick one hash function from the CityHash family. This 
function hashes every one of the 2®^"* strings to one of 
2®^ hash values. By a pigeonhole argument OTIl . there 
must be at least one hash value where at least 2®4"^/2®'^ = 
2®4(m-i) strings collide. 

- Pick another hash function. Out of the 2®4(’"“4) strings 
colliding when using the first hash function, we must 
have 2®4("‘“2) strings also colliding when using the sec¬ 
ond hash function. 

We can repeat this process m—1 times until we find 2®^ strings 
colliding when using any of these m — 1 hash functions. If 
an adversary picks any two of our 2®^ strings and we pick 
the hash function at random in the whole family of 2®^ hash 
functions, we get a collision with a probability of at least 
(m — 1)/2®4. Thus, while we do not have a strict bound on 
the collision probability of the CityHash family, we know 
just from the small size of its seed that it must have a rela¬ 
tively high collision probability for long strings. In contrast, 
VHASH and our CLHASH (see §|3 use more than 64 ran¬ 
dom bits and have correspondingly better collision bounds 
(see Table 1^. 

3 VHASH 

The VHASH family 012II25I was designed for 64-bit pro¬ 
cessors. By default, it operates over 64-bit words. Among 
hash families offering good almost universality for large data 
inputs, VHASH might be the fastest 64-bit alternative on 
x64 processors — except for our own proposal (see §|^. 

VHASH is e-almost Z\-universal and builds on the 128- 
bit NH family Oil: 

//2 

NH(s) = ^ ((s 2 i-i + k 2 ^-l mod 2®^) 

i—1 ^ 

X (s 2 i + fei mod 2®4)) mod 2^^®. 

NH is l/2®4-almost Z\-universal with hash values in [0, 2^'^^). 
Although the NH family is defined only for inputs contain¬ 
ing an even number of components, we can extend it to in¬ 
clude odd numbers of components by padding the input with 
a zero component. 


We can summarize VHASH (see Algorithm [^l as fol¬ 
lows: 

- NH is used to generate a 128-bit hash value for each 
block of 16 words. The result is 1 /2®4-almost Z\-universal 
on each block. 

- These hash values are mapped to a value in [0, 2^^®) by 
applying a modular reduction. These reduced hash val¬ 
ues are then aggregated with a polynomial hash and fi¬ 
nally reduced to a 64-bit value. 

In total, the VHASH family is l/2®4-almost Z\-universal 
over [0, 2®4 - 257) for input strings of up to 2®^ bits ifTH 
Theorem 1]. 

For long input strings, we expect that much of the run¬ 
ning time of VHASH is in the computation of NH on blocks 
of 16 words. On recent x64 processors, this computation in¬ 
volves 8 multiplications using the mulq instruction (with 
two 64-bit inputs and two 64-bit outputs). For each group of 
two consecutive words (si and Si+i), we also need two 64- 
bit additions. To sum all results, we need 7 128-bit additions 
that can be implemented using two 64-bit additions (addq 
and adcq). All of these operations have a throughput of at 
least 1 per cycle on Haswell processors. We can expect NH 
and, by extension, VHASH to be fast. 

VHASH uses only 16 64-bit random integers for the NH 
family. As in § |2.3| we only need one specific NH function 
irrespective of the length of the string. VHASH also uses 
a 128-bit random integer k and two more 64-bit random 
integers k'l and Thus VHASH uses slightly less than 
160 random bytes. 


Algorithm 1 VHASH algorithm 

Require: 16 randomly picked 64-bit integers k\,k 2 , ■ ■ ■ ,kiQ defining 
a 128-bit NH hash function (see Equation[TJ over inputs of length 
16 

Require: k, a randomly picked element of + 

z I integers w, x,y, z £ [0, 2^®)} 

Require: fc j> randomly picked integers in [0, 2^^^ — 258] 

1: input: string M made of |M| bytes 

2: Let n be the number of 16-word blocks ([IMI/lfi]). 

3: Let Mi be the substring of M from index i to i -|- 16, padding with 
zeros if needed. 

4: Hash each Mi using the NH function, labelling the result 128-bit 
results a; for i = 1 ,..., n. 

5: Hash the resulting ai with a polynomial hash function and store 
the value in a 127-bit hash value p: p = k^ + aik’^~^ -h ■ ■ ■ -f 
-h (|M| mod 1024) x 2®^ mod (2^27 _ i). 

6 : Hash the 127-bit value p down to a 64-bit value: 2 = (pi + k'^) x 
(p 2 -h kL) mod (2®^ — 257), where pi = p 4- (2®"^ — 2®^^ and 
P 2 = p mod (2®^ — 2 ®®). 

7: return the 64-bit hash value 2 
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3.1 Random Bits 

Nguyen and Roscoe showed that at least log(TO/e) random 
bits are required ll3TI 0 where m is the maximal string length 
in bits and e is the collision bound. For VHASH, the string 
length is limited to 2 ®^ bits and the collision bound is e = 
1/261 xherefore, for hash families offering the bounds of 
VHASH, we have that log m/e = log(2®^x2®^) = 123 ran¬ 
dom bits are required. 

That is, 16 random bytes are theoretically required to 
achieve the same collision bound as VHASH while many 
more are used (160 bytes) This suggests that we might be 
able to find families using far fewer random bits while main¬ 
taining the same good bounds. In fact, it is not difficult to 
modify VHASH to reduce the use of random bits. It would 
suffice to reduce the size of the blocks down from 16 words. 
We could show that it cannot increase the bound on the colli¬ 
sion probability by more than 1/2®^. However, reducing the 
size of the blocks has an adverse effect on speed. With large 
blocks and long strings, most of the input is processed with 
the NH function before the more expensive polynomial hash 
function is used. Thus, there is a trade-off between speed 
and the number of random bits, and VHASH is designed 
for speed on long strings. 

4 Finite Fields 

Our proposed hash family (CLHASH, see §|^ works over 
a binary hnite held. For completeness, we review held the¬ 
ory briehy, introducing (classical) results as needed for our 
purposes. 

The real numbers form what is called field. A held is 
such that addition and multiplication are associative, com¬ 
mutative and distributive. We also have identity elements 
(0 for addition and 1 for multiplication). Crucially, all non¬ 
zero elements a have an inverse a~^ (which is dehned by 
a X a~^ = a~^ x a = 1). 

Finite helds (also called Galois helds) are helds contain¬ 
ing a hnite number of elements. All hnite helds have cardi¬ 
nality p" for some prime p. Up to an algebraic isomorphism 
(i.e., a one-to-one map preserving addition and multiplica¬ 
tion), given a cardinality p”, there is only one held (hence¬ 
forth GF(p")). And for any power of a prime, there is a 
corresponding held. 

4.1 Finite Fields of Prime Cardinality 

It is easy to create hnite helds that have prime cardinality 
(GF{p)). Given p, an instance of GF{p) is given by the 
set of integers in [ 0 ,p) with additions and multiplications 
completed by a modular reduction: 

^ In the present paper, log n means logj n. 


- a Xq p’(p) b = a X b mod p 

- and a +gf(p) b = a + b mod p. 

The numbers 0 and 1 are the identity elements. Given an 
element a, its additive inverse is p — a. 

It is not difficult to check that all non-zero elements have 
a multiplicative inverse. We review this classical result for 
completeness. Given a non-zero element a and two distinct 
X, x', we have that ax mod p 7^ ax' mod p because p is 
prime. Hence, starting with a hxed non-zero element a, we 
have that the set {ax modp | x G [0,p)} has cardinal¬ 
ity p and must contain 1 ; thus, a must have a multiplicative 
inverse. 


4.2 Hash Families in a Field 

Within a held, we can easily construct hash families having 
strong theoretical guarantees, as the next lemma illustrates. 

Lemma 4 The family of functions of the form 

h{x) = ax 

in a finite field (GFlp'^)) is A-universal, provided that the 
key a is picked from all values of the field. 

As another example, consider hash functions of the form 

/i(xi, X 2 ,..., Xm) = a™“^xi -I- a""“^X 2 -I- \-Xm where 

a is picked at random (a random input). Such polynomial 
hash functions can be computed efficiently using Horner’s 
rule: starting with r = xi, compute r <— ar + Xi for i = 

2,..., TO. Given any two distinct inputs, xi, X 2 ,..., Xm and 
x'l, X 2 ,..., x'„, we have that /i(xi,..., Xm)-h{x'^ ,...,x'^) 
is a non-zero polynomial of degree at most to — 1 in a. By 
the fundamental theorem of algebra, we have that it is zero 
for at most to — 1 distinct values of a. Thus we have that the 
probability of a collision is bounded by (to — l)/p" where 
p" is the cardinality of the held. For example, VHASH uses 
polynomial hashing with p = 2^^^ — 1 and n = 1. 

We can further reduce the collision probabilities if we 
use TO random inputs ai,..., am picked in the held to com¬ 
pute a multilinear function: h{xi ,..., Xm) = 01 X 1 - 1 - 02 X 2 + 

• • • + amXm- We have Z\-universality. Given two distinct in¬ 
puts, xi,..., Xm and x'l,..., x'^, we have that Xi 7 ^ x' for 
some i. Thus we have that h{xi,..., Xm) = c+h{x'i ,..., x'm) 
if and only if o* = (x^ - x')“^(c + O'jiY ~ 

If TO is even, we can get the same bound on the collision 
probability with half the number of multiplications ITtII^ 

EH: 

/l(xi,X 2 , . . . ,Xm) 

= (Oi + Xi )(02 + X 2 ) -I-h (Om -1 + Xm-l)(Om + Xm)- 
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The argument is similar. Consider that 

(xi + ai)(ai+i + x^+l) - (x- + ai){ai+i + x'+i) 

= ai+i{xi - x[) + ai(x,+i - x^_^_^) + Xi+iXi + x-x'+i- 

Take two distinct inputs, xi, X2 ,..., Xm and x'i,X 2 , ■ ■ ■, x'^. 
As before, we have that Xi ^ x' for some i. Without loss of 
generality, assume that i is odd; then we can find a unique 
solution for Oi+i; to do this, start from /i(xi,..., Xm) = 
c + h(x’i ,..., x^) and solve for ai+i(xi — x'^) in terms of 
an expression that does not depend on Oi+i. Then use the 
fact that Xi — x' has an inverse. This shows that the collision 
probability is bounded by 1/p” and we have zl-universality. 

Lemma 5 Given an even number m, the family of functions 
of the form 

/i(xi, X2,..., Xm) =(ai + a;i)(a2 + X2) 

+ {0-3 + X3)(a4 + X4) 

H- 

+ ( ^m—1 “1“ ^m—l)(^m + ) 

in a finite field (GF{p'^)) is A-universal, providing that the 
keys tti,.. ., Qm cire picked from all values of the field. In 
particular, the collision probability between two distinct in¬ 
puts is bounded by 1/p". 


4.3 Binary Finite Fields 


Finite fields having prime cardinality are simple (see § 4.1 1 , 
but we would prefer to work with fields having a power- 
of-two cardinality (also called binary fields) to match com¬ 
mon computer architectures. Specifically, we are interested 
in GF(2®^) because our desktop processors typically have 
64-bit architectures. 

We can implement such a field over the integers in [0,2^) 
by using the following two operations. Addition is defined 
as the bitwise XOR ( 0 ) operation, which is fast on most 
computers; 


a -\~Qp(2^) b = CL (B b. 

The number 0 is the additive identity element (a 0 0 = 
0 0 a = a), and every number is its own additive inverse: 
a 0 a = 0. Note that because binary finite fields use XOR 
as an addition, Z\-universality and XOR-universality are ef¬ 
fectively equivalent for our purposes in binary finite fields. 

Multiplication is defined as a carry-less multiplication 
followed by a reduction. We use the convention that is 
the least significant bit of integer a and = 0 if i is 
larger than the most significant bit of a. The i*** bit of the 
carry-less multiplication a-kb of a and b is given by 

i 

{akrb)i = ^^ai-kbk ( 2 ) 

k=0 


where Oi-kbk is just a regular multiplication between two 
integers in {0,1} and bitwise XOR of a range 

of values. The carry-less product of two L-bit integers is 
a 2L-bit integer. We can check that the integers with 0 as 
addition and k as multiplication form a ring: addition and 
multiplication are associative, commutative and distributive, 
and there is an additive identity element. In this instance, the 
number 1 is a multiplicative identity element (a*l = l*a = 
a). Except for the number 1, no number has a multiplicative 
inverse in this ring. 

Given the ring determined by 0 and *, we can derive a 
corresponding finite field. However, just as with finite fields 
of prime cardinality, we need some kind of modular reduc¬ 
tion and a concept equivalent to that of prime number^ 

Let us define degree(x) to be the position of the most 
significant non-zero bit of x, starting at 0 (e.g., degree(l) = 

0, degree(2) = 1, degree(2-') = j). For example, we have 
degree{x) < 127 for any 128-bit integer x. Given any two 
non-zero integers a, b, we have that degree(a7k'6) = degree(a)0 
degree(&) as a straightforward consequence of Equation]^ 
Similarly, we have that 

degree(a 0 5) < max(degree(a), degree(&)). 

Not unlike regular multiplication, given integers a, b with 
6^0, there are unique integers a, f3 (henceforth the quo¬ 
tient and the remainder) such that 

a = a-kb (B 13 (3) 

where degree(/3) < degree(6). 

The uniqueness of the quotient and the remainder is eas¬ 
ily shown. Suppose that there is another pair of values a', P' 
with the same property. Then a' kb 0 P' = a -kb 0 /? 
which implies that (a' 0 a) * 6 = P' (B P- However, since 
degree(/3' 0 /3) < degree(6) we must have that a = a'. 
Erom this it follows that /? = P', thus establishing unique¬ 
ness. 

We define and mod operators as giving respectively 
the quotient {a^b = a) and remainder (a mod b = P) so 
that the equation 

a = a^b-kb 0 a mod b (4) 

is an identity equivalent to Equation (To avoid unneces¬ 
sary parentheses, we use the following operator precedence 
convention: *, mod and are executed first, from left to 
right, followed by 0 .) 

In the general case, we can compute b and a mod b 
using a straightforward variation on the Euclidean division 

^ The general construction of a finite field of cardinality p" for 
n > 1 is commonly explained in terms of polynomials with coeffi¬ 
cients from GF{p). To avoid unnecessary abstraction, we present finite 
fields of cardinality 2^ using regular L-bit integers. Interested readers 
can see Mullen and Panario (30), for the alternative development. 
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algorithm (see Algorithm which proves the existence of 
the remainder and quotient. Checking the correctness of the 
algorithm is straightforward. We start initially with values a 
and /3 such that a = a * & 0 /3. By inspection, this equality 
is preserved throughout the algorithm. Meanwhile, the algo¬ 
rithm only terminates when the degree of /3 is less than that 
of b, as required. And the algorithm must terminate, since 
the degree of q is reduced by at least one each time it is up¬ 
dated (for a maximum of degree(a) — degree(&) + 1 steps). 


Algorithm 2 Carry-less division algorithm 

1: input: Two integers a and b, where b must be non-zero 
2: output: Carry-less quotient and remainder: a = a ^ b and /3 = 
a mod 6, such that a = a* b (B /3 and degree (/3) < b 
3: Let a t— 0 and /3 ■<— a 
4: while degree)/?) > degree(b) do 
5: let a: <— 

6: a<—a;©a, /?■<— 

7: end while 
8 : return a and d 


Given a = a-kb 0 /3 and a' = a'kb 0 /?', we have that 
a 0 a' = (a 0 a') * 6 0 (/3 0 P'). Thus, it can be checked 
that divisions and modular reductions are distributive: 

(a 0 b) mod p = {a mod p) 0 (5 mod p), (5) 

(a0 6) ©p = (a-Fp) 0 (&-Fp). (6) 

Thus, we have (a 0 b) mod p = 0 ©> a mod p = b mod p. 
Moreover, by inspection, we have that degree(a mod b) < 
degree(6) and degree(a © &) = degree(a) — degree(6). 

The carry-less multiplication by a power of two is equiv¬ 
alent to regular multiplication. For this reason, a modular 
reduction by a power of two (e.g., a mod 2®"^) is just the 
regular integer modular reduction. Idem for division. 

There are non-zero integers a such that there is no inte¬ 
ger b other than 1 such that a mod 6 = 0; effectively a is a 
prime number under the carry-less multiplication interpreta¬ 
tion. These “prime integers” are more commonly known as 
irreducible polynomials in the ring of polynomials GF2[x\, 
so we call them irreducible instead of prime. Let us pick 
such an irreducible integer p (arbitrarily) such that the de¬ 
gree of p is 64. One such integer is 2®^ + 2"^ 0 2® + 2 + 1. 
Then we can finally define the multiplication operation in 

G0(2®4): 

a X g i?(264) 6 = {a kb) mod p. 

Coupled with the addition +gf( 20 ‘^) that is just a bitwise 
XOR, we have an implementation of the field GF(2^^) over 
integers in [0, 2®^). 

We call the index of the second most significant bit the 
subdegree. We chose an irreducible p of degree 64 having 


minimal subdegree (4)j^We use the fact that this subdegree 
is small to accelerate the computation of the modular reduc¬ 
tion in the next section. 


4.4 Efficient Reduction in GF(2®^) 

AMD and Intel have introduced a fast instruction that can 
compute a carry-less multiplication between two 64-bit num¬ 
bers, and it generates a 128-bit integer. To get the multipli¬ 
cation in GF(2®^), we must still reduce this 128-bit integer 
to a 64-bit integer. Since there is no equivalent fast modular 
instruction, we need to derive an efficient algorithm. 

There are efficient reduction algorithms used in cryp¬ 
tography (e.g., from 256-bit to 128-bit integers ifTTl l. but 
they do not suit our purposes: we have to reduce to 64- 
bit integers. Inspired by the classical Barrett reduction Q, 
Knezevic et al. proposed a generic modular reduction al¬ 
gorithm in GF(2"), using no more than two multiplica¬ 
tions f22i . We put this to good use in previous work ESI . 
However, we can do the same reduction using a single mul¬ 
tiplication. According to our tests, the reduction technique 
presented next is 30 % faster than an optimized implemen¬ 
tation based on Knezevic et al.’s algorithm. 

Let us write p = 2®"* 0 r. In our case, we have r = 
2^ + 2® + 2 + 1 = 27 and degree(r) = 4. We are interested 
in applying a modular reduction by p to the result of the 
multiplication of two integers in [0, 2®^), and the result of 
such a multiplication is an integer x such that degree(x) < 
127. We want to compute x mod p quickly. We begin with 
the following lemma. 


Lemma 6 Consider any 64-bit integer p = 2®^ (B r. We 
define the operations mod and © as the counterparts of the 
carry-less multiplication k as in § 4.3 Given any x, we have 
that 


X mod p 

= {{z © 2®^) * 2®'‘) mod p 0 z mod 2®^ 0 x mod 2®^ 


where z = {x -B 2®^) k r. 


Proof We have that x = (x © 2®^) k 2®^ 0 x mod 2®^ for 
any x by definition. Applying the modular reduction on both 


^ This can be readily verified using a mathematical software package 
such as Sage or Maple. 
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sides of the equality, we get 

X mod p = {x ^ 2®"*) * 2®^ mod p (B x mod 2®"^ mod p 
= {x^ 2®-*) * 2®^ mod p (B X mod 2®^ 
by Fact 1 

= (x -y 2®"^) * r mod p (B x mod 2®"^ 
by Fact 2 

= z mod p (B X mod 2®^ 
by z’s def. 

= {{z ^ 2®4) * 2®^) mod p (B z mod 2®"* 

0 X mod 2®"^ 
by Fact 3 

where Facts 1, 2 and 3 are as follows; 

- (Fact 1) For any x, we have that {x mod 2®"‘) mod p = 

X mod 2®^. 

- (Fact 2) For any integer z, we have that (2®“^ 0 r) * 
z mod p = p-k z mod p — 0 and therefore 

2®“^ * z mod p = r-k z mod p 

by the distributivity of the modular reduction (Equation]^. 

- (Fact 3) Recall that by definition z = (z -y 2®“^) * 2®"^ 0 
z mod 2®"^. We can substitute this equation in the equa¬ 
tion from Fact 1. For any z and any non-zero p, we have 
that 

z mod p= ((z^ 2®“^) * 2®'* 0 z mod 2®'‘) mod p 
= ((z ^ 2®“^) * 2®^) mod p e z mod 2®^ 

by the distributivity of the modular reduction (see Equa¬ 
tion]^. 

Hence the result is shown. 

Lemmaj^provides a formula to compute x mod p. Com¬ 
puting z = {x^2^'^)kr involves a carry-less multiplication, 
which can be done efficiently on recent Intel and AMD pro¬ 
cessors. The computation of z mod 2®^ and x mod 2®^ is 
trivial. It remains to compute ((z 2®'^) k 2®"^) mod p. At 

first glance, we still have a modular reduction. However, we 
can easily memoize the result of ((z 2®^) * 2®"^) mod p. 

The next lemma shows that there are only 16 distinct values 
to memoize (this follows from the low subdegree of p). 

Lemma 7 Given that x has degree less than 128, there are 
only 16 possible values of (z 2®^) k 2®^ modp, where 
z = (x 2®"*) * r and r = 2"^ 0 2® 0 2 0 1. 

Proof Indeed, we have that 

degree(z) = degree(a;) — 64 0 degree(r). 

Because degree(a;) < 127, we have that degree(z) < 127 — 
64 0 4 = 67. Therefore, we have degree(z 0 2®^) < 3. 
Hence, we can represent z 0 2®"*^ using 4 bits: there are only 
16 4-bit integers. 


Thus, in the worst possible case, we would need to mem¬ 
oize 16 distinct 128-bit integers to represent ((z 0 2®"*) k 
2®^) mod p. However, observe that the degree of z 0 2®“^ is 
bounded by degree(a:) — 64 0 4 — 64 < 127 — 128 0 4 = 3 
since degree(a:) < 127. By using Lemma we show that 
each integer ((z 0 2®^) *2®"^) mod p has degree bounded by 

7 so that it can be represented using no more than 8 bits: set¬ 
ting L = 64 and w = z 0 2®^, degree(w) < 3, degree(r) = 

4 and degree(u;) 0 degree(r) < 7. 

Effectively, the lemma says that if you take a value of 
small degree w, you multiply it by 2^ and then compute the 
modular reduction on the result and a value p that is almost 
2^ (except for a value of small degree r), then the result has 
small degree: it is bounded by the sum of the degrees of w 
and r. 

Lemma 8 Consider p = 2^ 0 r, with r of degree less than 
L. For any w, the degree of w k 2^ mod p is bounded by 
degree{w) 0 degree(r). 

Moreover, when degree(w) 0 degree(r) < L then the 
degree ofw k 2^ mod p is exactly degree(iy) 0 degree(r). 

Proof The result is trivial if degree(w) 0 degree(r) > L, 
since the degree of w * 2^ mod p must be smaller than the 
degree of p. 

So let us assume that degree(ti;) 0 degree(r) < L. By 
the definition of the modular reduction (Equation j^, we 
have 

wk2^ = wk2^^pkp 0 wk2^ modp. 

Let w' = w k 2^ ^ p, then 
wk2^ = w'kp 0 wk2^modp 

= w'kr 0 w'k2^ 0 wk2^T[iodp. 

The first L bits of w k2^ and w' * 2^ are zero. Therefore, 
we have 

iw' k r) mod 2^ = (w * 2^ mod p) mod 2^. 

Moreover, the degree of w' is the same as the degree of 
w. degree(i(;') = degree(w) 0 degree(2^) 0 degree(p) = 
degree(w)0L—L = degree(w). Hence, we have degree(i(;'* 
r) = degree(w)0degree(r) < L. And, of course, degree(tt;* 
2^ mod p) < L. Thus, we have that 

w' kr = wk2^ mod p. 

Hence, it follows that degree(w*2^ mod p) = degree(w'* 
r) = degree(w) 0 degree(r). 

Thus the memoization requires access to only 16 8-bit 
values. We enumerate the values in question (w*2®"*^ mod p 
for w = 0,1,..., 15) in Table It is convenient that 16 x 

8 = 128 bits: the entire table fits in a 128-bit word. It means 
that if the list of 8-bit values are stored using one byte each, 
the SSSE3 instruction pshufb can be used for fast look-up. 
(See Algorithm]^) 
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Table 3: Values of w-k2^^ mod p for w = 0,1,..., 15 given 
p = 2^4 + 2^ + 23 + 3. 


w 

decimal 

binaiy 

w 

decimal 

2®^ mod p 
binary 

0 

OOOO 2 

0 

OOOOOOOO 2 

1 

0001 2 

27 

OOOIIOII 2 

2 

OOIO 2 

54 

OOIIOIIO 2 

3 

OOII 2 

45 

OOIOIIOI 2 

4 

OIOO 2 

108 

OIIOIIOO 2 

5 

OIOI 2 

119 

OIIIOIII 2 

6 

OIIO 2 

90 

OIOIIOIO 2 

7 

OIII 2 

65 

OIOOOOOI 2 

8 

IOOO 2 

216 

IIOIIOOO 2 

9 

IOOI 2 

195 

IIOOOOII 2 

10 

IOIO 2 

238 

IIIOIIIO 2 

11 

IOII 2 

245 

IIIIOIOI 2 

12 

IIOO 2 

180 

IOIIOIOO 2 

13 

IIOI 2 

175 

IOIOIIII 2 

14 

IIIO 2 

130 

IOOOOOIO 2 

15 

IIII 2 

153 

IOOIIOOI 2 


Algorithm 3 Carry-less division algorithm 

1: input: A 128-bit integer a 

2: output: Cairy-less modular reduction a mod p where p = 2®^ -\- 
27 

3: 2 <- (a-^2®4)*r 

4: w z ^ 2®^ 

5: Look up tu * 2®^ mod p in Tahlel^ store result in y 

6: return a mod 2®^ © 2 mod 2 ®'*© y 

Corresponding C implementation using x64 intrinsics: 

uint64_t modulo( __ml28i a) { 

__ml28i r = _mm_c vtsi64_si 1 28 (2 7); 
__ml28i z = 

_mm_clmulepi64_si 128 ( a, r, 0x01); 
__ml28i table = _mm_setr_epi8 (0 , 27, 54, 
45, 108,119, 90, 65, 216, 195, 238, 
245,180, 175, 130, 153); 

__ml28i y = 

_mm_shuffle_epi8 ( table 
, _mm_srli_sil28 (z ,8)); 

__ml28i tempi = _mm_xor_si 128 (z , a ); 
return _mm_cvtsi 1 28_si64 ( 

_mm_xor_si 128 (tempi ,y)); 

} 


5 CLHASH 

The CLHASH family resembles the VHASH family — ex¬ 
cept that members work in a binary finite field. The VHASH 
family has the 128-bit NH family (see Equation [T]), but we 
instead use the 128-bit CLNH family: 

i/2 

CLNH(s) = 0 ((S2i-1 ® ^ 21 - 1 ) * (S2i ® ^2i)) 0) 

where the Si and fc^’s are 64-bit integers and I is the length 
of the string s. The formula assumes that I is even: we pad 


odd-length inputs with a single zero word. When an input 
string M is made of |M| bytes, we can consider it as string 
of 64-bit words s by padding it with up to 7 zero bytes so 
that |M| is divisible by 8. 

On x64 processors with the CLMUL instruction set, a 
single term ((s 2 i_i ® ^ 21 - 1 ) * {s 2 i ® ^ 21 )) can be com¬ 
puted using one 128-bit XOR instructions (pxor in SSE2) 
and one carry-less multiplication using the pclmulqdq in¬ 
struction: 

- load {k 2 i-i,k 2 i) in a 128-bit word, 

- load (s 2 i-i, S 2 i) in another 128-bit word, 

- compute 

ik2i-l,k2i) ® {S2i-l,S2i) = {k2t-l ® S2^-l,k2i ® S2i) 
using one pxor instruction, 

- compute {k 2 i-i ® S 2 i-i) * (^ 2 * ® S 2 i) using one pcl¬ 
mulqdq instruction (result is a 128-bit word). 

An additional pxor instruction is required per pair of words 
to compute CLNH, since we need to aggregate the results. 

We have that the family s —>■ CLNH(s) mod p for some 
irreducible p of degree 64 is XOR universal over same-length 
strings. Indeed, Z\-universality in the field GF(2®4) follows 
from Lemma 1^ However, recall that Z\-universality in a bi¬ 
nary finite field (with operations * and 0 for multiplication 
and addition) is the same as XOR universality — addition 
is the XOR operation (0). It follows that the CLNH family 
must be l/2®4-almost universal for same-length strings. 

Given an arbitrarily long string of 64-bit words, we can 
divide it up into blocks of 128 words (padding the last block 
with zeros if needed). Each block can be hashed using CLNH 
and the result is 1 /2®4-almost universal by Lemma If 
there is a single block, we can compute CLNH(s) mod p to 
get an XOR universal hash value. Otherwise, the resulting 
128-bit hash values ai, 02 ,..., can then be hashed once 
more. Eor this we use a polynomial hash function, k'^~^ai + 
fc"“^a 2 0 • • • 0 a„, for some random input k in some fi¬ 
nite field. We choose the field GF{2^^'^) and use the ir¬ 
reducible p — 2^2^ 0 2 0 1. We compute such a poly¬ 
nomial hash function by using Horner’s rule: starting with 
r = ai, compute r ^ A: *r 0 for i = 2, 3,..., n. Eor this 
purpose, we need carry-less multiplications between pairs 
of 128-bit integers: we can achieve the desired result with 
4 pclmulqdq instructions, in addition to some shift and 
XOR operations. The multiplication generates a 256-bit in¬ 
teger X that must be reduced. However, it is not necessary 
to reduce it to a 127-bit integer (which would be the re¬ 
sult if we applied a modular reduction by 2^^^ 0 2 0 1). 
It is enough to reduce it to a 128-bit integer x' such that 
x' mod (24^’^ 0 2 01) = X mod {2^'^'^ 0 2 01). We get the 
desired result by setting x' equal to the lazy modular reduc- 
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tion m X mod lazy (2^^^ + 2 + 1) defined as 

X mod lazy(2^^^ + 2+1) 

= X mod ((2^^^ + 2 + 1) * 2) 

= X mod (2^^® +4 + 2) 

= (x mod 2i2«) © ((x + 2^2®) * 4 © (x + 2^2®) * 2). 


It is computationally convenient to assume that degree{x) < 
256 — 2 so that degree((x + 2^^®) * 4) < 128. We can 
achieve this degree bound by picking the polynomial coeffi¬ 
cient k to have degree(fc) < 128 — 2. The resulting polyno¬ 
mial hash family is (n— 1)/2^^®-almost universal for strings 
having the same length where n is the number of 128-word 
blocks ([|M|/1024] where \M\ is the string length in bytes), 
whether we use the actual modular or the lazy modular re¬ 
duction. 

It remains to reduce the final output O (stored in a 128- 
bit word) to a 64-bit hash value. For this purpose, we can use 
s —1- CLNH(s) modp with p = 2®^ + 27 (see § 4.4 1 , and 
where k” is a random 64-bit integer. We treat O as a string 
containing two 64-bit words. Once more, the reduction is 
XOR universal by an application of Lemma Thus, we 
have the composition of three hash functions with collision 
probabilities 1/2®^, (n— l)/2^^® and 1/2®"^. It is reasonable 
to bound the string length by 2®"^ bytes: n < 2®^/1024 = 
2®4. We have that 2/2®^ + (2®^ - l)/2i26 < 2.004/2®^. 
Thus, for same-length strings, we have 2.004/2®^-almost 
XOR universality. 

We further ensure that the result is XOR-universal over 
all strings: P {h{s) = h{s') © c) < 1/2®"^ irrespective of 
whether |s| = |s'|. By Lemma it suffices to XOR the 
hash value with k” * |M| mod p where k” is a random 64- 
bit integer and |M| is the string length as a 64-bit integer, 
and where p = 2®^ + 27. The XOR universality follows for 
strings having different lengths by Lemma and the equiv¬ 
alence between XOR-universality and Z\-universality in bi¬ 
nary finite fields. As a practical matter, since the final step 
involves the same modular reduction twice in the expression 
(CLNH(s) mod p) © ((A:" * |M|) mod p), we can simplify 
it to (CLNH(s) © {k" * \M\)) mod p, thus avoiding an un¬ 
necessary modular reduction. 

Our analysis is summarized by following lemma. 


Lemma 9 CLHASH is 2.004/2^'^-almost XOR universal 
over strings of up to 2®'^ bytes. Moreover, it is XOR universal 
over strings of no more than 1 kB. 


The bound of the collision probability of CLHASH for 
long strings (2.004/2®"^) is 4 times lower than the corre¬ 
sponding VHASH collision probability (1/2®^). For short 
strings (IkB or less), CLHASH has a bound that is 8 times 
lower. See Table|^for a comparison. CLHASH is given by 
Algorithmic 


Table 4: Comparison between the two 64-bit hash families 
VHASH and CLHASH 



universality 

input length 

VHASH 

2 ^ -almost Z\-universal 

1-259 bytes 

CLHASH 

XOR universal 

1-1024 bytes 


^4^-almost XOR universal 

t025-284 bytes 


Algorithm 4 CLHASH algorithm: all operations are carry¬ 
less, as per § |4.3| The ^ operator indicates a left shift: O 
33 is the value O divided by 2®®. 

Require: 128 randomly picked 64-bit integers fci, A: 2 ,..., fci 28 defin¬ 
ing a 128-bit CLNH hash function (see EquationjTj over inputs of 
length 128 

Require: k, a randomly picked 126-bit integer 
Require: k', a randomly picked 128-bit integer 
Require: k", a randomly picked 64-bit integer 
1: input: string M made of |M| bytes 
2: if IMI < 1024 then 

3: O ^ CLNH(A4) © (k” * \M\) mod (2®^ + 27) 

4: return O 

5: else 

6: Let n be the number of 128-word blocks ([11141/1024]). 

7: Let Mi be the substring of M from index 1281 to 128f + 127 

inclusively, padding with zeros if needed. 

8: Hash each Mi using the CLNH function, labelling the result 

128-bit results for i = 1,... , n. That is, ai <— CLNH(Mi). 

9: Hash the resulting ai with a polynomial hash function and store 

the value in a 128-bit ha.sh value O: O <— ai * © • • • © 

an mod lazy (2^^’’ + 2 + 1) (see LquationjC. 

10: Hash the 128-bit value O, treating it as two 64-bit words 

(Oi, O 2 ), down to a 64-bit CLNH hash value (with the addi¬ 
tion of a term accounting for the length |M| in bytes) 

^ ^ ((Cl © fc'i) * (02 © k'f) © (fe" * \M\) mod (2®-^ + 27). 

Values A;) and are the two 64-bit words contained in k'. 

11: return the 64-bit hash value 2 : 

12: end if 


5.1 Random Bits 

One might wonder whether using 1 kB of random bits is nec¬ 
essary. For strings of no more than 1 kB, CLHASH is XOR 
universal. Stinson showed that in such cases, we need the 
number of random bits to match the input length OTI . That 
is, we need at least 1 kB to achieve XOR universality over 
strings having IkB. Hence, CLHASH makes nearly opti¬ 
mal use of the random bits. 


6 Statistical Validation 

Classically, hash functions have been deterministic: fixed 
maps h from U to V, where \U\ \V\ and thus collisions 
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are inevitable. Hash functions might be assessed according 
to whether their outputs are distributed evenly, i.e., whether 
|/i“^(a;)| « \h~^{y)\ for two distinct x,y € V. However, 
in practice, the actual input is likely to consist of clusters of 
nearly identical keys f23i : for instance, symbol table entries 
such as tempi, temp2, temp3 are to be expected, or a col¬ 
lection of measured data values is likely to contain clusters 
of similar numeric values. Appending an extra character to 
the end of an input string, or flipping a bit in an input num¬ 
ber, should (usually) result in a different hash value. A col¬ 
lection of desirable properties can be defined, and then hash 
functions rated on their performance on data that is meant to 
represent realistic cases. 

One common use of randomized hashing is to avoid DoS 
(denial-of-service) attacks when an adversary controls the 
series of keys submitted to a hash table. In this setting, prior 
to the use of a hash table, a random selection of hash func¬ 
tion is made from the family. The (deterministic) function is 
then used, at least until the number of collisions is observed 
to be too high. A high number of collisions presumably in¬ 
dicates the hash table needs to be resized, although it could 
indicate that an undesirable member of the family had been 
chosen. Those contemplating switching from deterministic 
hash tables to randomized hash tables would like to know 
that typical performance would not degrade much. Yet, as 
carefully tuned deterministic functions can sometimes out¬ 
perform random assignments for typical inputs ll23l . some 
degradation might need to be tolerated. Thus, it is worth 
checking a few randomly chosen members of our CLHASH 
families against statistical tests. 

6.1 SMHasher 

The SMHasher program m includes a variety of quality 
tests on a number of minimally randomized hashing algo¬ 
rithms, for which we have weak or no known theoretical 
guarantees. It runs several statistical tests, such as the fol¬ 
lowing. 

- Given a randomly generated input, changing a few bits 
at random should not generate a collision. 

- Among all inputs containing only two non-zero bytes 
(and having a fixed length in [4, 20]), collisions should 
be unlikely (called the TwoBytes test). 

- Changing a single bit in the input should change half the 
bits of the hash value, on average in (sometimes called 
the avalanche effect). 

Some of these tests are demanding; e.g., CityHash 051 fails 
the TwoBytes test. 

We added both VHASH and CLHASH to SMHasher 
and used the Mersenne Twister (i.e., MT19937) to gener¬ 
ate the random bits GSl . We find that VHASH passes all 
tests. However, CLHASH fails one of them: the avalanche 


test. We can illustrate the failure. Consider that for short 
fixed-length strings (8 bytes or less), CLHASH is effec¬ 
tively equivalent to a hash function of the form h{x) = 
a -k X mod p, where p is irreducible. Such hash functions 
form an XOR universal family. They also satisfy the iden¬ 
tity h{x 0 y) 0 h{x) = h{y). It follows that no matter 
what value x takes, modifying the same ***' bit modifies the 
resulting hash value in a consistent manner (according to 
/i(2*+^)). We can still expect that changing a bit in the input 
changes half the bits of the hash value on average. However, 
SMHasher checks that h{x 0 2®+^) differs from h{x) in any 
given bit about half the time over many randomly chosen 
inputs X. Since h{x 0 2®+^) 0 h{x) is independent from x 
for short inputs with CLHASH, any given bit is either al¬ 
ways flipped (for all x) or never. Hence, CLHASH fails the 
SMHasher test. 

Thankfully, we can slightly modify CLHASH so that all 
tests pass if we so desire. It suffices to apply an additional 
bit mixing function taken from MurmurHash IT] to the result 
of CLHASH. The function consists of two multiplications 
and three shifts over 64-bit integers: 

X ^ a; 0 (x 33), 

x^xx 18397679294719823053, 

X X 0 (x 33), 

X ^ X X 14181476777654086739, 

X ^ X 0 (x ^ 33). 

Each step is a bijection: e.g., multiplication by an odd inte¬ 
ger is always invertible. A bijection does not affect collision 
bounds. 


7 Speed Experiments 

We implemented a performance benchmark in C and com¬ 
piled our software using GNU GCC 4.8 with the -02 flag. 
The benchmark program ran on a Linux server with an In¬ 
tel 17-4770 processor running at 3.4 GHz. This CPU has 
32kB of LI cache, 256 kB of L2 cache per core, and 8MB 
of L3 cache shared by all cores. The machine has 32 GB 
of RAM (DDR3-1600 with double-channel). We disabled 
Turbo Boost and set the processor to run only at its high¬ 
est clock speed, effectively disabling the processor’s power 
management. All timings are done using the time-stamp coun¬ 
ter (rdtsc) instruction Il34l . Although all our softwar^is 
single-threaded, we disabled hyper-threading as well. 

^ Our benchmark software is made freely available under a 
liberal open-source license (https://github.com/lemire/ 
StronglyUniversalStringHashingl, and it includes the 
modified SMHasher as well as all the necessary software to reproduce 
our results. 
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Our experiments compare implementations of CLHASH, 
VHASH, SipHash H, GHASH lUl and Google’s City- 
Hash. 

- We implemented CLHASH using Intel intrinsics. As 
described in § |^ we use various single instruction, mul¬ 
tiple data (SIMD) instructions (e.g., SSE2, SSE3 and 
SSSE3) in addition to the CLMUL instruction set. The 
random bits are stored consecutively in memory, aligned 
with a cache line (64 bytes). 

- Eor VHASH, we used the authors’ 64-bit implementa¬ 
tion II 25 I . which is optimized with inline assembly. It 
stores the random bits in a C struct, and we do not 
include the overhead of constructing this struct in the 
timings. The authors assume that the input length is di¬ 
visible by 16 bytes, or padded with zeros to the nearest 
16-byte boundary. In some instances, we would need to 
copy part of the input to a new location prior to hash¬ 
ing the content to satisfy the requirement. Instead, we 
decided to optimistically hash the data in-place with¬ 
out copy. Thus, we slightly overestimate the speed of 
the VHASH implementation — especially on shorter 
strings. 

- We used the reference C implementation of SipHash IH. 
SipHash is a fast family of 64-bit pseudorandom hash 
functions adopted, among others, by the Python language. 

- CityHash is commonly used in applications where high 
speed is desirable lEiKia. We wrote a simple C port of 
Google’s CityHash (version 1.1.1) llTSl . Specifically, we 
benchmarked the CityHash64WithSeed function. 

- Using Gueron and Kounavis’ 021 code, we implemented 

a fast version of GHASH accelerated with the CLMUL 
instruction set. GHASH is a polynomial hash function 
over GT’(2^^®) using the irreducible polynomial -f 
x'^ h{xi,X2,... ,Xn) = a’^xi -\-a^~^X2 + 

... -f axn for some 128-bit key a. To accelerate com¬ 
putations, Gueron and Kounavis replace the traditional 
Horner’s rule with an extended version that processes 
input words four at a time: starting with r = 0 and pre¬ 
computed powers a? , a^, compute r a^(r -f Xi) -f 
a?Xi+i -\-a?Xi+ 2 +aXi +3 for i = 1,4,... ,4[m/4j — 3. 
We complete the computation with the usual Horner’s 
rule when the number of input words is not divisible 
by four. In contrast with other hash functions, GHASH 
generates 128-bit hash values. 

VHASH, CLHASH and GHASH require random bits. 
The time spent by the random-number generator is excluded 
from the timings. 

7.1 Results 

We find that the hashing speed is not sensitive to the con¬ 
tent of the inputs — thus we generated the inputs using a 


Table 5: A comparison of estimated CPU cycles per byte 
on a Haswell Intel processor using 4kB inputs. All schemes 
generate 64-bit hash values, except that GHASH generates 
128-bit hash values. 


scheme 

64 B input 

4 kB input 

VHASH 

1.0 

0.26 

CLHASH 

0.45 

0.16 

CityHash 

0.48 

0.23 

SipHash 

3.1 

2.1 

GHASH 

2.3 

0.93 


random-number generator. Eor any given input length, we 
repeatly hash the strings so that, in total, 40 million input 
words have been processed. 

As a first test, we hashed 64 B and 4 kB inputs (see Ta¬ 
ble]^ and we report the number of cycles spent to hash one 
byte: for 4kB inputs, we got 0.26 for VHASHj^0.16 for 
CLHASH, 0.23 for CityHash and 0.93 for GHASH. That 
is, CLHASH is over 60 % faster than VHASH and almost 
45 % faster than CityHash. Moreover, SipHash is an order 
of magnitude slower. Considering that it produces 128-bit 
hash values, the PCMUL-accelerated GHASH offers good 
performance: it uses less than one cycle per input byte for 
long inputs. 

Of course, the relative speeds depend on the length of 
the input. In Fig. we vary the input length from 8 bytes 
to 8kB. We see that the results for input lengths of 4kB 
are representative. Mostly, we have that CLHASH is 60% 
faster than VHASH and 40% faster than CityHash. How¬ 
ever, CityHash and CLHASH have similar performance for 
small inputs (32 bytes or less) whereas VHASH fares poorly 
over these same small inputs. We find that SipHash is not 
competitive in these tests. 

7.2 Analysis 

From an algorithmic point of view, VHASH and CLHASH 
are similar. Moreover, VHASH uses a conventional multi¬ 
plication operation that has lower latency and higher through¬ 
put than CLHASH. And the VHASH implementation re¬ 
lies on hand-tuned assembly code. Yet CLHASH is 60% 
faster. 

For long strings, the bulk of the VHASH computation 
is spent computing the NH function. When computing NH, 
each pair of input words (or 16 bytes) uses the following in¬ 
structions: one mulq, three adds and one adc. Both mulq 
and adc generate two micro-operations (pops) each, so with¬ 
out counting register loading operations, we need at least 
3 4-2x2 = 7 pops to process two words M- Yet Haswell 
processors, like other recent Intel processors, are apparently 

* For comparison, Dai and Krovetz reported that VFIASH used 
0.6 cycles per byte on an Intel Core 2 processor (Merom) 1251 . 
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data input size (bytes) data input size (bytes) 


(a) Cycles per input byte (b) Ratios of cycles per input byte 


Fig. 1; Performance comparison for various input lengths. For large inputs, CLHASH is faster, followed in order of decreas¬ 
ing speed by CityHash, VHASH, GHASH and SipHash. 


limited to a sustained execution of no more than 4 fj,ops per 
cycle. Thus we need at least 7/4 cycles for every 16 bytes. 
That is, VHASH needs at least 0.11 cycles per byte. Be¬ 
cause CLHASH runs at 0.16 cycles per byte on long strings 
(see Table 1^, we have that no implementation of VHASH 
could surpass our implementation of CLHASH by more 
than 35 %. Simply put, VHASH requires too many /rops. 

CLHASH is not similarly limited. For each pair of in¬ 
put 64-bit words, CLNH uses two 128-bit XOR instructions 
(pxor) and one pclmulqdq instruction. Each pxor uses 
one (fused) p,op whereas the pclmulqdq instruction uses 
two /iops for a total of 4 /rops, versus the 7 /iops absolutely 
needed by VHASH. Thus, the number of /rops dispatched 
per cycle is less likely to be a bottleneck for CLHASH. 
However, the pclmulqdq instruction has a throughput of 
only two cycles per instruction. Thus, we can only process 
one pair of 64-bit words every two cycles, for a speed of 
2/16 = 0.125 cycles per byte. The measured speed (0.16 cy¬ 
cles per byte) is about 35 % higher than this lower bound 
according to Table This suggests that our implementation 
of CLHASH is nearly optimal — at least for long strings. 
We verified our analysis with the lACA code analyser m. 
It reports that VHASH is indeed limited by the number of 
/rops that can be dispatched per cycle, unlike CLHASH. 

8 Related Work 

The work that lead to the design of the pclmulqdq instruc¬ 
tion by Gueron and Kounavis ini introduced efficient algo¬ 
rithms using this instruction, e.g., an algorithm for 128-bit 
modular reduction in Galois Counter Mode. Since then, the 
pclmulqdq instruction has been used to speed up crypto¬ 
graphic applications. Su and Fan find that the Karatsuba for¬ 
mula becomes especially efficient for software implemen¬ 
tations of multiplication in binary finite fields due to the 


pclmulqdq instruction IMll . Bos et al. Q used the CLMUL 
instruction set for 256-bit hash functions on the Westmere 
microarchitecture. Elliptic curve cryptography benefits from 
the p c Imu 1 qdq instruction Il32ll3^l3^ . Bluhm and Gueron 
pointed out that the benefits are increased on the Haswell 
microarchitecture due to the higher throughput and lower 
latency of the instruction ||8]. 

In previous work, we used the pclmulqdq instruction 
for fast 32-bit random hashing on the Sandy Bridge and 
Bulldozer architectures i26\ . However, our results were dis¬ 
appointing, due in part to the low throughput of the instruc¬ 
tion on these older microarchitectures. 


9 Conclusion 

The pclmulqdq instruction on recent Intel processors en¬ 
ables a fast and almost universal 64-bit hashing family (CL¬ 
HASH). In terms of raw speed, the hash functions from this 
family can surpass some of the fastest 64-bit hash func¬ 
tions on x64 processors (VHASH and CityHash). More¬ 
over, CLHASH offers superior bounds on the collision prob¬ 
ability. CLHASH makes optimal use of the random bits, in 
the sense that it offers XOR universality for short strings 
(less than 1 kB). 

We believe that CLHASH might be suitable for many 
common purposes. The VHASH family has been proposed 
for cryptographic applications, and specifically message au¬ 
thentication (VMAC): similar applications are possible for 
CLHASH. Euture work should investigate these applica¬ 
tions. 

Other microprocessor architectures also support fast carry¬ 
less multiplication, sometimes referring to it as polynomial 
multiplication (e.g., ARM ^ and Power ll20l ). Euture work 
might review the performance of CLHASH on these archi- 
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lectures. It might also consider the acceleration of alternative 
hash families such as those based on Toeplitz matrices llJTll . 
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